Welcome to the World of Pokemon

library(datasets)

First lets bring in the Pokemon dataset.

Pokemon = read.csv(file = 'data/Pokemon.csv')
head(Pokemon)
##   X.                  Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1  1             Bulbasaur  Grass Poison   318 45     49      49      65
## 2  2               Ivysaur  Grass Poison   405 60     62      63      80
## 3  3              Venusaur  Grass Poison   525 80     82      83     100
## 4  3 VenusaurMega Venusaur  Grass Poison   625 80    100     123     122
## 5  4            Charmander   Fire          309 39     52      43      60
## 6  5            Charmeleon   Fire          405 58     64      58      80
##   Sp..Def Speed Generation Legendary
## 1      65    45          1     False
## 2      80    60          1     False
## 3     100    80          1     False
## 4     120    80          1     False
## 5      50    65          1     False
## 6      65    80          1     False



Reasearch Question 1:

What is the most common type for pokemon?

Since there are secondary typing for Pokemon, we will have to combine the observations from the two variables.


But first lets see what the data shows for Pokemon that only have one type.

# this is the Type.2 for charmander
# a Pokemon that has a singular "Fire" type

Pokemon$Type.2[5]
## [1] ""

The data displays a “” or empty for a Pokemon that has to Type.2. This will muddy our data if we’re looking for the most common type, since technically “” is a valid observation as far as the code is concerned.


These are the tables for Type.1 and Type.2 variables.

table(Pokemon$Type.1)
## 
##      Bug     Dark   Dragon Electric    Fairy Fighting     Fire   Flying 
##       69       31       32       44       17       27       52        4 
##    Ghost    Grass   Ground      Ice   Normal   Poison  Psychic     Rock 
##       32       70       32       24       98       28       57       44 
##    Steel    Water 
##       27      112
table(Pokemon$Type.2)
## 
##               Bug     Dark   Dragon Electric    Fairy Fighting     Fire 
##      386        3       20       18        6       23       26       12 
##   Flying    Ghost    Grass   Ground      Ice   Normal   Poison  Psychic 
##       97       14       25       35       14        4       34       33 
##     Rock    Steel    Water 
##       14       22       14
# how many types are in Type.1
length(table(Pokemon$Type.1)) 
## [1] 18
# how many types are in Type.2
length(table(Pokemon$Type.2))
## [1] 19

As you can see there is an extra type in table(Pokemon$Type.2) that is “” with the number 386 to represent all of the observations/Pokemon with “” for Type.2.


This is the combined table of Type.1 and Type.2, minus the empty “” type Now we need to be careful, because this does not represent every Pokemon. As you can see when we look at the sum of the table, it shows 1214 when we only have 800 observations/Pokemon.

combinedTypeTable = (table(Pokemon$Type.1) + table(Pokemon$Type.2)[-1])


paste("The sum is: ", sum(combinedTypeTable))
## [1] "The sum is:  1214"

Finally lets see which type is the most common for pokemon, either as their Type.1 OR their Type.2.

# finds the max(table), and then finds the index of the max value, then returns both the number and its name, in our case the type

combinedTypeTable[which(combinedTypeTable == max(combinedTypeTable))]
## Water 
##   126

As you can see, the most common typing for a Pokemon is Water.

But as always a visualization will help in contextualizing this piece of information.

combinedTypeTable = sort(combinedTypeTable, decreasing=TRUE)
barplot(combinedTypeTable[1:5], main = "TOP 5 MOST COMMON TYPES IN POKEMON", ylab="Frequency", col="lightblue")



Reasearch Question 2:

What generation has the greatest number of each “type”(Type.1 and Type.2 combined) of Pokemon


First we need to know how many different generations are present in our data.

table(Pokemon$Generation)
## 
##   1   2   3   4   5   6 
## 166 106 160 121 165  82

There are 6 different generations of Pokemon listed in the data, with a varying number of Pokemon for each generation.

Now we need to parse the data so that we can see how many of each type is in each generation.

# table for the number of each Type.1 type for all entries where the Generation is 1
table(Pokemon$Type.1[which(Pokemon$Generation == 1)])
## 
##      Bug   Dragon Electric    Fairy Fighting     Fire    Ghost    Grass 
##       14        3        9        2        7       14        4       13 
##   Ground      Ice   Normal   Poison  Psychic     Rock    Water 
##        8        2       24       14       11       10       31
# table for the number of each Type.2 type for all entries where the Generation is 1
table(Pokemon$Type.2[which(Pokemon$Generation == 1)])
## 
##              Dark   Dragon    Fairy Fighting   Flying    Grass   Ground 
##       88        1        1        3        2       23        2        6 
##      Ice   Poison  Psychic     Rock    Steel    Water 
##        3       22        7        2        2        4
length(table(Pokemon$Type.1[which(Pokemon$Generation == 1)]))
## [1] 15
length(table(Pokemon$Type.2[which(Pokemon$Generation == 1)]))
## [1] 14

But we run in to a problem due to the way that Pokemon types are set up. A Pokemon has one or two types, and the Type.2 is valued the same as Type.1. This means that we need to count both when looking at how many types of each Pokemon are in each generation. For example the first Pokemon in the data, Bulbasaur, would count as both a “Grass” and “Poison” type, with both holding equal weight. This causes discrepancies like we see above, where some types are missing from the Type.2 table and the Type.1 table . We will have to modify the table ourselves to make this work.


Since we will most likely have to do this for every generation, lets make a function that fills in the missing “types” and returns one vector with the combined values for both Type.1 and Type.2

# this function will help us combine the two Type tables int o one table with all unique types from both Type.1 and Type.2, as well as add up duplicates


combineTypeTblByGen = function(generation){
  
  # merging the table of Type.1 and Type.2 for the generation
  mergedTypeTable = merge(table(Pokemon$Type.1[which(Pokemon$Generation == generation)]), table(Pokemon$Type.2[which(Pokemon$Generation == generation)])[-1], all= TRUE)
  
  
  # vector version of the merged tables above
  typeVec = mergedTypeTable[[2]]
  names(typeVec) = mergedTypeTable[[1]]
  
  
  
  # for loop that adds up the repeat values for the types, and then removes the duplicate name+value from the vector
  len = length(typeVec) - 1
  for(i in 1:len){
  
    if(i >= length(typeVec)){break}
  
    else if(names(typeVec[i]) == names(typeVec[i + 1])){

      typeVec[i+1] = typeVec[i] + typeVec[i+1]
      #print(typeVec[i])

      typeVec = typeVec[-c(i)]
  
    }

  }
  
  return(typeVec)
  
  
}


barplot(combineTypeTblByGen(1), las=2, main="Frequency of Types for Generation 1")

The plot above combines frequency of “types” in both Type1 and Type2 for generation 1, and because we have the function combineTypeTblByGen() we are able to replicate this for every other generation as well.


Since we have the combined tables for every generation, we can move on to comparing the generations by Type.

This next function will allow us to see how many of one specific type is in every generation. For example it will allow us to see the number of “Bug” types there are in generation 1 through 6.

# function takes in a character argument that represents a Pokemon Type and returns a vector of length 6, with each index representing the number of that type of pokemon are in that generation number

compareTypeAcrossGen = function(type){
  
  oneTypeAllGensVec = c()
  
  for(i in 1:6){
    
    #print(combineTypeTblByGen(i)[type])
    
    oneTypeAllGensVec[i] = combineTypeTblByGen(i)[type]
   
    
  }

  names(oneTypeAllGensVec) = c("Gen 1", "Gen 2", "Gen 3", "Gen 4", "Gen 5", "Gen 6")
  
  return(oneTypeAllGensVec)

}


compareTypeAcrossGen("Bug")
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##    14    12    14    11    18     3
compareTypeAcrossGen("Dark")
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 
##     1     8    13     7    16     3
barplot(compareTypeAcrossGen("Bug"), main="Frequency of Bug Types in Each Generation(Gen)", las=2, col = "lightgreen")

We have the comparisons across generations for individual types, but that doesn’t accomplish anything unless we want to create 18 different barplots. Instead lets use the plotly package to create an interactive barplot that shows the frequency of each type across the generations to get a broader understanding of the dataset.

q2FinalGraph

Lets do some descriptive stats –> then well do hypothesis testing

head(Pokemon)
##   X.                  Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1  1             Bulbasaur  Grass Poison   318 45     49      49      65
## 2  2               Ivysaur  Grass Poison   405 60     62      63      80
## 3  3              Venusaur  Grass Poison   525 80     82      83     100
## 4  3 VenusaurMega Venusaur  Grass Poison   625 80    100     123     122
## 5  4            Charmander   Fire          309 39     52      43      60
## 6  5            Charmeleon   Fire          405 58     64      58      80
##   Sp..Def Speed Generation Legendary
## 1      65    45          1     False
## 2      80    60          1     False
## 3     100    80          1     False
## 4     120    80          1     False
## 5      50    65          1     False
## 6      65    80          1     False